BELL, BOYD & LLOYD PLL Fax: 2024630678 



May 19 2008 16:36 



P. 06 



(1?) 




Europfilflchds Patdntamt 
European Patent Office 
Office etircipeen dee breVbte 



(12) 



(43) Datdotpublioation: 

(2H) Apf)1ication:r^Uniber: 9^02864.6 
(22) Dateoffinng: 13.04w1999 



(11) EP 0 950 965 A2 

EUROPEAN PATENT APPLICATION 

(51) imci.e: G06F 17/30 



(84) Designated Contracting States: 


• Price, Morgan H 


ATBECHbYDEDKESFIFRGBGR lEITLI LU 


PqIo Alto, Oalifomia 94306 (US) 


MCNLPTSE 


• Golovehlnskyj Gene 


Designatad Extanskvi States: 


Palo AKo^ Caltfomla 94306 (US) 


ALLTLVMKROSl 






(74) Representative: fikDno JameOy Robert Edmund 


(30) PHorfty: 1 4.04.1998 US 59ddS 


GILL JENNINGS & EVERY 




Broadgme House 


(71) Applicant: XEROX CORPORATION 


7 Eldpn Street 


nochester, Mew York 14$44 (US) 


London EC2M 7LH (GB) 


(72) Jriventora: . 




« 8ch]in,WQllemN, 




Mento Park, California 94029 (US) 





(54) Method and eyetem for docuntent storage management bas^ on docunrrent content ' 



(57) A doc umsnl storage management system and 
method that managsis the storage of documents based 
. Upon the eimilarity of the content of the documented 
Grbupsi of documents are created based upon the sirh-' 
ilarity of the conterls of the documents. Those groups 
are displayed tojthe user in a ranked tid of selectable 
groups to peninitt "selection of a groiip or doeiinienl The 



storage of !he selected group or docurpent ia then man- 
aged Ijy, for example, defeting. compressing", or copy- 
ing. The displayed I ist may be ranked based upon a least 
rscentty used policy, the relevance to a predeteifrilned ' 
topic, the size of the group, the radius of the group based 
upon the maxmum distance o1 any document from the 
group cfihtroid; the number of ddcuni^s In Ihe gibO 
and any othdr ponribinatlbn^ cA parameters.. 
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Description 

[0001] This invention is directed to a msthod and a systemfor managing the fiiomga oi documents. More part iculariy, ' " 
ihls tnyention i& directed to a method and a system' for managing document stored in a Umitsd capacity eiocage device 
* based on the content of th©€lofecl documents. 

[0002] Infomialion access plays a key role across an ever expanding range of feleure and work activities. iLaptop, 
hand-held and palmtop computers are t>e[ng rehvented as "kxirtable infonmation appliances' with the promiee ot in- 
fbnmtlon acceee anytime and anywhere. However, u$ere of theee devices have become reiBnl on conUnupus, high 
speed, iow-cost networtcs available to computers, ar^d portable computers are rarely connected to such netWOllce. Orie 

10 technique thai makes portable computers leee dependent upon netvratking is a cache. 

[0003] A cache Is geriaralty defined as a fast access memory that stores a copy of frequently referenced information. 
A cache reduces the reliance of the computer on a connection to a network and can proyido documents even wheri 
disconnected from a network. Cache management systems control the inlormation that Is stored in a cachel 
[0004] An important part of a cache management system is the replacement policy. The replacement policy governs 

'5 which items will be nenioved from the cache when the cache fills up, thai is, there is Insufficierif space In the cache to 
stiM-e a new item j A replacement pol Icy requires inherently drlficu|l decis Ions because each decision lrn/oly©B"pr edicting " 
the future. The accuracy of those predfetions can only be n^easured after several requests for information haye been 
provided to the cache. The accuracy of the predictions is measured by the efficiency of the replacement policy. The 
efflclerKy of a replacement poltey for a cache lis determined by the ratb of hite (items found in the cache) to misses ' 

^ (items missing from the cache). .. . ... . 

IPOOS) ' The errfolenoy of a oaohe management syeiem, takes on greyer importance for portable information appii^ 
ancee. .VVhen a computer is connected over a wirelese nstworK and. a cache miss occurs, it may not only take a . 
relatively long tinrte to locate and download the missing document, but the miss rnay, aisp result In expensive Yees for 
the network oonriectbn ueagefOr the download, tf the cache had the lequeeted document in kx;al storage, tKere would 
bia no need to download the document f mm a network and expensive'access charges' would be avoided 'Moreover, ' ' 
When a system Is disconnected from a network a cactie miss may stop the user from coritinuing work, in an attempt 
to' solve these and other problems with' cache rrianagement systems/ researchers have beeri irwestigating new re- 
placement policies that better predict requests for documents or files. . 
' [0006] One attismpt to solve these problems rfwolyes augrtiisntlng a computational repl^emsht polk^ w£th direct ' 

3o user Interartton/Thls is based dn the assumption that people often know which is the most irnportant infofT7iatk)n to 

' . - ~ keep in local, storage. Two systems; TeleWeb^and Mowg li, are nrkilbila web browsers that allow lisers to lock dixiuments 
into the local storage, Teleweb and Ntowgli are described In "TbieWeb: L<X)seiy Connected Access to the Worid-Wid 
Web", W.N. SchUit el aL. Computer Networks and ISDN ^ystems^ 28 pp. 1431-1444. 1996. and 'Optimizing World- 
Wide Wei) for V\feaMy Conned Workstations: An Indirect Approach". T Atanko et al. . PrbceedlriQ of the ghd ; 

55 International Workshop on Sen^jces in D tstnTauted and Networked Environments (3DNE'95) , June S-G, 1995, respec* 
tiyaly, and are incorporated herein by reference in tbeir entireties. In TeleWeb, the storage contents are exposed to 
the user in a file by file nsting and users may ph or lock items, or delete items. One problem wrih this typaioi user 
control is thsA the users tend to tock more than they delete. Therefor^, the local Storage becomes mui:h lees effectivb 
- beoaMse it beoorpes filled with looked* but no longer relevant, documents. 

^ IPPOT]. Another appiowtitaten^ 

arid wNch docuniente are appropriate to Iteep. Such ari approach b deecribed in Tiow to Pibgrem Networked Ponable 
Computers*. D, Goldberg et al„ Pnjceeding of ttie Fourth Workshop bri Worlwiation Operating Systems , pp. 30-33» 
October 1993, iricorporated hereiri by reference in its eritireiy VVhen riBplacerhent is necessary, the eysterh hiay, for 
example, use a pop-up diabg box to ask for suggestions on whbh file to renv^e from storage. A pop^up dialog box is 

4^ shown fh Rg. 1. Fig. 1 shows a Graphical User Interface (GUt)IO displaying a list of documents 12. The GUMO allows 
the user to select B file to discard based on the title 

[0006] 14. The Ret may be sorted for the user according to the titles 1 4 of the documents, the sizes of the documents 
or the dates of last access. The appropriate sorting Is invoKed.when the user clicks the heading for the appropriate 
column. One problem with thte approach is that the user is forced to explicitly select documents. A user must explicitly 
^ designate frioee gigpwments thert the user wiahee tg manags,' For oxaniple, a wer e forced to'ldi'mulate a s^ich query 
to retrieve the appropriate documents. Therefore, freeing up storage space of any significant amount requires multiple 
inleractrons. Furtfiermore. it re often difficulf to determrie a file's importance from only its file name, size or date of last 
access. 

[0009] Some storage management systems try to predict file accesses by analyzing the history of file accesses. T^vo 
SB Systems have been Implemented that alk>w users to turn on recording o1 file accesses. The syelems are described in 
"Detection and Exploltatton of File Working Sets'. C. Talt et aL Prooeedlngg of the 11th International Conference on . 
Distrtouted Computing Systems. May 1 991 . pp. 2'9, and ■Disconnected Operation in a Distributed File System', James 
Kistler, (1 993) Ph.D. Thesis, School of Computer Science on file with the Carnegie Melton University Library, incoipo- 
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raiGd herein by rclorenc© in thefr entiretle Tracos of fija acceee ^ are used by these systems to hoart or pretelch files 
into the tocal sto^ge. More recently, a fiystem descflliecr in ■lnte)ligent Ffle Hoarding for Mobtie cpmputerg-; d.. talt et 
al., MOBICOM 95, pp. 1 1 0-1 25, ACM, Inc.. Incprporaled herMi by reference in its entirety. e)Ct,riKM this ccwric^pi to a 
grapntc interraiee' mat pa/mits useFs Id select which of a' number of prontee to hoard. These profiles are created by 
s obeen/ing patterns of file acQeesee to both appiicatlDn files and data files. However, this style of stomge management 

* orjjy^wfiHcsWllw^ * 

[0010] A number of systems have used automatic techniques to present groups of related documents to a u&ei: One 
'technique characterizes clusters of documents wim kayworcte and tittee of representative documents in wder to support \ 
browsing or f UlMexl retflevaL This technique is described ri 'Scatter/Gathef: A Cluster-Baeed. Approach to. Browsing * 
10 Laroe Document Coilecliogs'. D.R. Cutting etal.. Proceedlnosofthe 16th Annua(lntornationalAC<^IOIRConfere^ 
1992. ACM. Inc.. and 'Reexarrtning the Cluster Hypothesis: Scatter/Gather pn Retrieval Results', f^.A. Hoarst et al, 
PfOCcedinQS of the 19lh Annual International ACM/SI(5IR Conference . Zurich 1 996. incorporated herein by reference 
in their entjretras. 

19911 J Anolhersystempresentfs clusters of Web 
^« and allows the user to expand the caitents of clusters. Such a eyetem ia described in " Auiomatfeally drganfczlnjg Book- " 

markssper Contentfi*. YS. Maarecket ai„ Computer Nelwdrke and jSDN Systems 28. do. 1 321-1 3331 996, IneoipofHted 

herein by reference iri fts entirety. However, these systems do not provide facilities for dsletlngi .compfessihg, Qr taking 

other memagement adidns on groups of docume'ms. 

[0012] None ofthe above s/stenris works well for person 
SO thM'<^not be neatly organtzed and that do not hove a rec^ 

[001 3] A docurhent management system is needed th^t augments the advantages of direct use r Intsmdion with a 

* frilninrialiyiwafiivetechni^ue.doMnoiflllupthQcaehav^^ ; 
patterns. 

[001 4] ' This invention pro^d9s a method and a system Hial allow users to manage storage space by choosing among 
25 " groups Of documents having similar content The content of the documerits may be presenied 'to the user in 'groups 
identHleU by topics. Grouping documents albws the user to free up space quickly by avoiding expensive file-by-file 
declslbris. Grc^uphg documents also permits the user to determine which files need to be stored, locally, ertherfw speed 
of'access or to avoid interruptions when subsequently operating in a discorinected state, based upon the' topics for 
which, the-'user anticipates a future need: For example, Fig. 2 shows a GUI 20 according to one embodbnent of this 
30 invention. TTie GUI 2d presents three groups from which Uie user may choose. Each group has a topic that may 
characterize the documents collected into that group: For example, as shown in Rg. 2, documents are eolleeied Into 
three. groups, a first group 22 entitled '95 Windows- Microsoft f>rr netvyork workstation', a second group 24 ;enttt)dd 
'nribbile wfreless network computing workstation system^ and a third group 26 entitled "developer MTMt' web CGI 
M{cn;>soit Program URL". Thus, the first group 22 Includes documents related to networking with Windows 95 and NT 
9S the second group 24 includes documents related to Isstjes in mobile Computing and wireless networks and the third 
gi}oup 26 indudes docurnents related to devekjping web applicati^^ 

[001 S\ ' The user can dstermine ¥vhich files (i.e. documents) to /emove^ prefetch or hoard based upon these^ groups 
of docamente than have eimilar cohtem. Tlie user can quickly select a group of documents having similar; bm'nd longer 
relevant content,. delete that group^ and free up a large amount of epaoe i3y rerrioving the entire gipup. The user may - 
'^f' ; aldo>itlclpate which groups of d^ 

external source &ito the local storage based upon the conterii of ttiose documeritB. 

[0<h fi] The user rriay rely solsly lipon the deteiminatioh of the similarity the donterit' of the dixunienis within the 
groups or may addSionally rely upon the determinatbn of the date and time of last access, the relevance of the' groups 
to a predetermined topic, the total size of the group, the radius of the group, and/or the number of documents in the' 

^ groups. Additionally, the groups may be ranked by the date arid time of last access to the documents in the group. The 
groups may be, alternatively, ranked in accordance with any number or oombinatwn of group charaderistics. 
[0017] The grdupfng of documents can also be used as a guideline for making doc umsht storage declskms For 
exarnple, a group may be selected out of a listing of groups arid the selected group may be expanded Into a list of' 
document from tile selected group. Preferably, the document list is ranked in accordance with predetermined or user 

so defined attributes. The user is then flee to seled one, several or all of the documents within the group for a storage 
management prqcese. 

[001 8] These and other futures and advantages of thfe invention are described in or are apparent from the following 
detailed description of the preferredjemlxxJirnehts. . . l • . 

[001.9] . . The preferred embodimenta of this inyention.will ba despribed in detail., with reference to the tollowingflgures, . 
55 wherein: 

Fig. '1 te a graphical user intertaice of a Oonventtchat document storage rnanagement system; 

Rg. 2 Is a gi)ephic^t user interface i^owfng documents gn?uped by the einiifarrty of content of the docurnents in 
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accordance .with this inyention; . ; 

Fig. 3 is e b(6ck dlagmm d an information system using the document storage managemenl aystem of one em- 

bodirYWhtisf^ifi iiiveritfon: ' . I ! 

Fig. 4 is allow chart outlining the operation of ona embodiment of the cfocumem storage management sysiom of 

this invention; 

Rg. 6 IB a now chart outlffiing docurrwrit similarity grouping r 

'Rgi6 iea blixkdt^gra^ . . . I . " 



[0020] ■ Fig; 3 shows a block diagrarri of an Inforrnatlori eyslerri 50 operaiing with. the document rrianagament system '. 
rn accordance whh this (nventlon A cache 32 may be a portion of a local storage device euch as a hard driye 34 or a 
memory devjc© 36. In any case, the cache 32 is a local storage "bulTer that stores documems that are anticipated to 
be accessed by the Information eystem 30. 

[OOZll It should be appreciated that the document management system of this invantidn may extend to mar>y types 
of information storage systems rogardtesa of whetherthey are portable or non-portable. It stiould also be appreciated 
IS that although this descripf oh generally ref a rs only to Mooumerile*, that the term iddcumeht*' also'encornpaeaes any 
file or object that may be grouped Wtth Other files based upon the similarity of the oorttent.ol those files or objects. A 
'ftje" is Intended to vKlude any unit of infbmnation 'Content* is intended to include any divisible or idamlfifid^le portion 
of a document fiik:h as abstracts, liisy vvDitle, tmes or des'^^ 

[002ZI ft ie to bp understood that the term 'documeni' ii intended to Iriclude text, video, audio and any.other Medium 
so and any combination of rhodia Further, It ts.lo be uncjeretood that -the terni text is intended to Include text, .digital inlc 
iri'siroke or bitmap formial, audldy' Irnagee, video or any ottier s! rupture or conlent of a doourherii 
[0023] The Intormalion eyetem 30 includes a processor 38 running software that embodies the document manage- 
ment eyetem and which controls the document management system. A communication interface 40 allowa the rnfor- 
rration system 30 to communicatG with an extarnal data eourcp 42, euch as another con^piiter, a network, a file sender, 
^5 or.a rernpvable medium such as a portable hard disk or CDROM. The communlcaiion channel 44 established betvyaen 
the commuriication interface 40 and the e^ctemal data source 42 may be interrupted or disconnected tor a variety of 
reasons. The communication ciiannel 44 may be expensive to.maintain or the information system 30 may be a portable 
; inf oTTTiation systeiTi that is oniy intemiittent ly connected to the exierr^al data eourc© 42. 
" [00241 The prcicessor 38 communieafes With an input/<^^ 
so conventional inpui^output devices, such as a irrouse 48, a keyboanj 50, a display 52, ancVor a pen 54. While F^. 3 
only shows a display 52, it Is to be" understood tinat any type of presentaton device that is appropriate for the type of 
document is intended' > 

10026] document management eystem manages the documents in the local storage device 134 by eeliactlveV 
removing documents from the local storage device 34 to create additional space In the local storage device 34. The 
^ document mahagerrient system may also communicate through the communication channel 44 to the exterrtal data 
. source 42 to dbwntoad documants and stone doeomehitB in the local storage device 34 li ahtfcipatton of f utuns requlre- 
n^ents. 

[0026] As shown m Fig. 3, the system 30 is pref eral>iy ifnplemented using a programmed general purpose computer 
iHoWever, the eyetem 30 can also be implemented using a special purpose computer, a programmed microprocessor 
40 or fnicrocontrolley and any necessary peripheral cntegrataJ circuit el wnants, an ASIC or other integrated circuit, a 
hardwired eledronic or circuit euch as a discrete element clicuH a programmable logic device such as a PLD, 
PLA FpGA or PAL or the like. In general, any "device on which finite sfate macHirie capable isf Implementing the. 
flowchart shown jn Figs. 4 and 5 can be used to implement the eystem 30. 

[0027] Additionally, as shown in ffig. 3, the memory 36 is preferably impiem anted u$lng static or dynamic RAM. 

4S However, the local storage device 34, can also be Implsrnented using a floppy disk and disk drive, a writable optical 
disk and disk drive, flash memory or fh e like. Additionally, it shoukJ be appreciated that the local Storage device; 34 can 
be either distinct portksns ofa single memory or physically distinct mBmories. This Is also the case with the memory 36. 
[0020] Furthermore, it should be appreciated that the links 44, 37 and 39 connecting the external data source 42 to 
the communicatipn interface 40 and the processor 38 to the memory 96 and the hand drive 34, respectivelyJcan be 

so wired or wireless links to a network (not shown). The network can be a local area networK, a wide area netvyork, an 
irrtrahst, the Internet or any other distributed processing and storage rtetwork. in this case, the electronic data is pulled 
from a physically remote external data source 42, the nierpory 36 or hard drive 34 through the finks 44, 37 or 39 for. 
processing in the processor 38 according to the method outlined below Some portions Of or the entire elsctronk;. 
document 22 can be stored locally in a portion of the memory 36, hard drive 34 or some other memory (not shown) of 

ss the system 30. 

[0029] Fig.' 4 Isi a flow chart of the operation of one embodiment of the cdritrol routine of the docurnerit management 
sn^em of thie invention. The comrql routine starts at step Si 66, and continues to step 511 0 where it receives a request 
to griciup the doourherrts. The documents being grouped may be located In the cache 32 or In the external dataj equrcs 
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42. IhQ local documonts are QTOUpod wh9n mere etorage space fa needed in th9 ccwhe 32 and Iho dbouments in the 
extemal data source 42 are grouped when the informatior) systenn 30 is badtng more documentB into the cache 32. 
Grouping the documents relies upon procedures that are well krwwn in the art. These praeedures group doeumerts 
b^ed upon the s|lmilar?iy <tf the content of the documents. An example or such a proceduf^ is diecloeed in 'Automatic 
s Taxi Procasecng'. by Gerard Sulton, pp, 303-309, Addison-Wesley Publishing Company. Inc. (1989). which is incoipch 

* rated herein refGrenco in its entfrely. 

[0030] ThB specific method chosen for grouping the documsnts is not crucial, because grouping can taKe place in 
the background. .Therefore, the epeed df the grouping method may not be important. The only requirement that the 
system groujf>s the* documents b^sd upon the similarity of the content (n the documents. It fe to be underetood that " 
10 the temi "elmilarhy' is intended to Jncfude any measure of a document or a portion o1 a document's relatednees oT 
relevance to another document or portion of a document. The ehnHarlty rnay becalcuJated, predetermined or determined 
by a Ueer. The documents may be grouped by any cohveiitionaj clustering algorithm Examples of cjustarir^ algortthms 
inlclude algonthms that cluster based on the size, radiusiof the document, the tmia ofa documenfe creatioh, last edit 
or last access or any other attribute of a document Overlapping of groups may be allowed or prohibited. A specific 

* eacarriple of a grouping methbd will be described in detail below 

[D031] The coritrol routine displays an ordered or ranked group listing at etep SI 20. An example of a ranked group 
liet is shown In Fig, 2. The control routine theri detenrnines. at step S136, whether a'group has beeri selected by the 
user If a group has been selected, the control routine cohtlnues'td Step $140. Olhervylee,' the control roufine jiimps to 
st^S190. 

SQ [0032] In step ^8140, the corttrol system determines whether the user has requested that the selected group be 
expanded. If tre control loutine determnes that the user has requested that the selected group be expanded, then the 
control routine continues to step S150. Olhenwise, the control routine jumps to step S170. In etep S150, the ddcumert ' 
management system displays a ranked list of the documents wtth^ the selected group Fig. 2 shows an e^qoarided 
, lieting 29 of the eelecied group 22./ _ ' _ 

S:6 [0033] if, at Step SI 40, the control system detsmiines that an expand group command liae been received, then the 
control routine proceeds to step Si 50. At step S150 the control routine diepteye a ranked list of documents within the 
selected group. Each document wrthh the grpup le selectable by e user. Jf the control system deiBrmfnes at step Si 60 
that a document has been selected, then the control routine proceeds to step SI 70. 

[0034] At step S 1 7Q, the system determines whether a' user has inpiit a commend to operate ori the selected doc- 
:' 30 ument or group of documents. If at step S170. the system rocelves.a command, then the control routine continues to 

step S1 Sp. The corrvnand rnay rnclude but Is not linriited to a command to delete the selected document or group of 

documents from the cache or to store the selected document or group Of documents f rorri ttie extemal data source. 

Into the cache. Iri step. 51 80. the control routine executes the command on the selected docurrwm or .group of docu- 

nienta The control routine then proceeds to step Si 9 
3S [0035] Aitematively. if at step SI 60, no document is selected, then the contfxsl routine continues to step $i 90, Sinrh 

ilarty, n at step S|70, the ccntrpl system does not receive a command, then the control roufine also jumps to step SI 90, 
. In step SI 90, the control routine stops, 

[0036] In'stap SI 70, the user of the method and the system of this invention may choofie from any numiber of storage ' 

managemert commar«ls. Bcarnples of storage managernem commands Include deletion compreBsion. and copying 
40 of the selected dcxiiimant or group of docufnentfi. 

t^T] One rnethod for determining the eimilarlty of docurherite cOTipnses the method outjiped in the How diart of 
' Fig. 5, whk:H begins in elep 3300. Next in etep S3ld*, individual words occurring in the documents of a collection are 

identrfied. Then, in step S320, a stop Est of common function words (*and/ 'of,"* *or/ "but," ^e,' andthe like) are used 

to delete high-frequency function words that are iiisufftciently speci^^ 
49 in step S^O an ^automatic sufffaC'StrippIng routine is used to reduce each rainatning word to Word-etem form. TTiis 

routine reducee all words exhibiting the ^me stem to a common form. Control then continues to step S340. 

[0038] Next, In. step S340, a typical document similarity coefficient is then obtained A descriptbn of a document 

similarity coefficient calculation of a method and System of one embodiment of this invention follows. For ea<^ remairh 

Ihg word stem Tj'and document Djj a weighting factor Wjj.le deiemnlned. This weighting factor Wg includes \n part the 
^ term frequency and In part the inverse document-frequency for the temi. For example, the weightlrig factor W§ Is. 

determined as: 

Wy = Vlog(^). (D* 

■ 5ff ■ • . : • . . . ... r'r ... • 

[0039] Then, document vectors are emulated A docufnent Vector for each document is represented by the set 
of word^etems together with the corresponding weighting factors; 
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Dj = fr,.Wj^.T2.W2;...T,,Wft); (2) 

[0040] or 

= (Drt,De....Dh). . (3). . 

Whore t8 a cbnfibinatioh of ths word stems Tj and the Inverss documem freqiiency weTghl w^. * 
[0041] Finally. Ihe similarity coefficient IS calculated between dt« 



IS 



40 



SO 



: . • . . . . r .; .. 

Sm(Dj,D2) I Dij Dji^ (4) 

Jri ... * 

[0042] ; Eq. (4) can be used with vanablfr4ength vectors that exhibit a variable number 6! included tervne/ Howwer, 
when non-noimalized document Vectors are ueed, longer documente whh more terms have a greater chance of match-' 
ing than do the shorter document vectors. Thle. however, does not necessarily produce the best results. The document 

i9imilaii^i'fac*w«Eq/(4)lsa^ . " . .* 



[0049] ' Eq. '(5) repreeenle the coeiie of the angle between ^e document vadore considered as vedore In a space 
cflcfirr»nsions, vMiere lis the number of distinct terriis in the systera . 

[0044] After the etmllaricy ooelflclem is determined between document vectors In step S340, control conilnuee to step 
6350. in step 8350« documents are ranked In decreasing order of similarity. Next» in step 8360, groups are.tormed 
based oh a elml^nty rnetric. For escample, ;the documents may be' grouped Using an algorithm that sta'ne'wrth each 
document in' its own group and Heratlvely merges groupie. A group Is melted with another group only if the merger ' 
results rn a group with the smallest possible radius. That is^ for example, If only the first and the second groups remain 
as potential mergers then the first group is Gxamined to dsiehmine the; radius of merged group thai includes the first 
group and a second group is examined to determhe the radius of a mergod group that indudes the second group. If 
the radius of the merged group that includeifi the first group is smaller 0ian the merged group that includes the seoor^d 
group then the first group is merged. If there are more than two potential mergers theri all potentlaf mergers are ex- 
amined and orfy the mergerlhat results in the smallest m^Sus ie perfomnedl This process is repeated until no additiona] 
merging is possible without exceeding a radius limit. 

[0045] • There are several attrlbutjss of documents or files that can be selected by the user to create good groups of . 
candWates for removal. Groups that take up more storage space are more beneficial to remove. A group with low 
relevance that has been rgmoved is less Gkeiy to incur a cache miss than a group with high relevance. A 'narrow" 
grcMip (one coritaining very similar documente) Is a good candidate for rernoval because its narrowness rninimxzes the' 
risk that users wi|r(nadvenently remove important doctjments because the short summary presented for a group does 
not resemble sorne outlying document that happens to be relevant lb assist the User, one embodiment of this irtyanteon 
uses a linear corhblnaixm of the metrics of: 1) the time since the last access; 2) the relevance to the user's ifiterasts; 
3) the storage space for the entire group; 4} the number of documente; and 5} the radius of the group; to detennine 

whkrh Groups to present to th6 user' ' * . 

[004€| The standard cache r^tacerrient policy is the LJeast Recenlly' Used (LPU) poHcy, "where ^ doovmerffe rele- 
vance is the inverse of the time since its last access. A document or group having a low relevance, because it fias 
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been a long time since its fast access, is the best candidate for removal. JhQ system ot this inventiOii 80ts the LBU 

time for a group pi documents to be ttiat of the most recently accessed document 

[0047] The system may optionally have an evoMng query that approximates the user's interests, or some other 
mechanism for retevance feedback A relevance feedback process can be used to improve query Tqrmutlttpn .wtiife 
5 altering the original queries in two substaritial ways: ' .T . * 

1 . Terms preisent tn previously retrieved ddcumants'tfiat have been identtfted ae releivant to the query aire'added ' 
10 ih© original query tpmiuiatlone: and .... .. 

2. The weights of the original query terms are altered by replaclrig the inverse document-frequency ponton ot the 
70 weights withlerm-relevance weights obtained by using the occurrence character istice of the temie in tho provlouely 

reeved relevant and non*relevanidbcumefii3 of the cdlectidn. 

[0048] Assuming that the initial query is available ii the form specified in a manner similar to Eq. (9).' a reformulated ■ 
query Will take the following tqmt 

Q = a(W^.W^ w,t)+ 

.** . : \ . .... " / . .... f?** - 

£0 

where ' 

' dand P' may take' values between 0 and 1; and 
■ the weights of thie' t initial terms rhay be generated by takirig coimblnBd term^requ ency and inverseH^kTCumeht 
frequency weights. 

^ [0049] • The y^jghlsdthenewV added terms t|^i to on the other hand, rnay be de^^ 
frequency and a ;term«rQlQvanco weight. 

[0050] . With a relevance feedback mechanism the relevance of the document to the ueer'^ Interest can be e&timated. 
' . - The systern car then base tfie releyarice for a group of documents on the docunient having the maximum relevance • 
. ' Within the group, ■ • ' 

30 . fOO&l] . The systern may optionally have a user conlrollabfe device, such as a GUI interface for setting the rninimum 
' * ■ -&fz& of the groups: This allows the user to tfecjtfe how much benefit they require tor each interactibn with tfie'^rage 
management Interface. 

[00&2J The system may also pennit the user to determine the minimurn or maximum width of groups that are to be 
presented. To compute the width of a group, the system uses the radius, which is defined as the maximum distance 

^ ot any element from the centrold of the group. Thie melilc te preferred to an IntrH-duster similarity (e.g.« meari square 
distance f ram the centroid) because it is more sensitive to outliers, and the outliers are the documents in a group that 
are tho.'least similar in content to the other documents ... 
[00S3] The groups of documents may than be ranked ih accordance with any one and^or oomblnatlon of the previously 
dMcrlbedmetrie&orinacoo(d£u^ A abore for each group can be 

^ asepressad as: 



45 |J* (relevance) 

y»(UrtBlaiw) + 

$ * (« of documentaX 

where a, t> ^ e are user defined constants. 

[0054] . Wheri the use r ts the cache, the groups with the lowest scores are preeerrted to the user before groups having 
higher ecores. Aftemallvely, groups with the highest scores are presented first when the system is baing used to select 
documents to download from an external data source. 
^ [0055] The system ot this ^vention presents the groups In a scrollable, explodable tat>le 20, as shown in f^ig. 2. For 
eacK grbup» the user sees the rank' 60. distinguishing keywords 62. the number of' documents contained Within theA 
group 64, the fo^ size of the gioup (in kilobytes) 66 and a mosi recent time and date of aocese of th< group 6B. For 
groups with only one document, the list of keywords may be replaced wSh the document tiila. Initlallyt the table 20 is 



7 . 

PA6E12/54'RCVDAT5m/20084:31:15PM[EastemDaylp - 



BELL, BOYD & LLOYD PLL Fax: 2024630678 May 1 9 2008 16:37 P. 13 



EP0g50965 A2 

sorted by the rank 60, but the user may eoft lha table 20 by me size 66, th© humbor of documenls 64 or the time of 
last access 6B. The user changes the sort vareble by eeiocHng the heading of the respect^e column. When ia usar 
oxpiodea a group ioafiret level; the €ysiem * 
Wh.en ^ user explodes the giqjp f uther, all the ctocumem« 10 me" group are dteplayed sorted by time of acpess 68. 

a Users rnay $e|ect [entire groups or several files wiihini a group for removal 

[00$6] Fig. 6 shows a block diagram of one Qmbodimsnt of the processor 35 or processing system ol Ihte invention. 
As fihowTi ii Fig. ^, the processor 33 Is implemeated uelng'a gBnoral purpose computsr 90 comprising a controller 86. 
■" a memory sa', a group generator 72. a keyword generator 74. a group score calculalor 76, a group expander 79, a 
jdoqument/gToup. selector 80. a storage manager 92 and a group filter 84. These elements of the genera) purpose 

10 '"computer 89o' m interconnaned by a bus '76.' Tli© group generator 72. the keyword gerieratbr 74, the grcaip- score " 
calculator 76^ the group expander 73, the document/ gnsiip selector BO; the storage manager 82 and the group filler 
34,. conv^jtecl by controller 66, are used to irriplement the flow chan of Figs. 4 and 5. It should be appreciated th^ 
many other implementations of these eleriWrtB wiil tie apparerit to those skilled in the art 

[0057] The group (generator 72 hnf^mants the control foutinQ outlined in F^ig. 5. T^o keyword generator 74 geriarates 
16 keywords for each group. The keywiorda'may than be used 1o indicate ths content of the file to a user viewing a group 
listing. The keyword generator 74 may be a type of document suinmarifer. The group score calculator 76 gehsirates a 
group score. An »cample method for calculating a group score has beeii detailed above. Ttie group score may b« used 
to rank the groups In a group listing. The group expander 78 expands asetected group into a Hsl 01 ck»umants contained 
within the selected group. As detailsd earlier, a user may then select lnelMdua| toumente On Which to perform various ; 
so storage managenient functic>ns. The document selection is perfoimed uairig dcitsumom/grbup aeledlbr 80. The storage 
manager 82 performs varfous storage management funolltons'on the selotiM documBnt(s) or flroup(B). Tbs Storage 
manager can perform many priocesses that are comrrion'to many isonveritionai storaige managers* such as'deletlng. 
rnwing, copymg-or adding apolntartoa do^ 
in the background. 

25 [0058]. While, this siveoHon has been described with the specie embodimGnts outlined above, many alten^ativee, 
rnodifksatlons and variations are apparent to those skilled in the art. Accordingly, the prefcn^ embodiments descr ibed 
above are illustrative and not limiting. V^ious changes maybe made without departing from the spirit and scope of the • 
irwernion sis defied in the follows ' . . - 

,30 ' " 

Olaiine 

AdixsumOTtlloi^emanageme^ 
. .'stored on th# document storage device> the system comprising: 

a group generator that generates at least one group a* .documents from the plurality of documents based on 

the content of the plurality of documents; 

a selector that is capable of selecting one Of the at least one 

a atoragj» rfianager tl^t nr^ages storage of the select 

2. The system of claim 1 , wherein the group generator generates the at least one group of documents based on the 
slrnilarity erf the content ol the plurality of documents. 

3. ; The system of claim l or claim. wherein the group generator generates at least one group of documents based 

further on atjleast one attribute of each of the plurality of documents. 

41 A mettidd fbf rnanagihg a dbcOT 

grouping a plurality of documents stored in a document storaQe device into a plurality of groups based on the 
fo content of the plurality of documents; 

' eeledhi^' ai least one ol the plural 
rirariaglrig tlie storage of fhe eeleded al iMsi ori^ 

The methodlof claim 4, whereln'the grouping is based on the elmllaifty of the content of the plurafity of documents. 

the method^ claim 4 or claim i5. wHerein the plurafity of documents are grouped fiirther based upon at feast one 
attribute Of each of the plurality of documents. 
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7. A graphical user fcrrterface for fnanagiHQ the storage of a pluralfty ol documents that are storad ln a'document 
• storagD^vfce, the Interface cornpfl&fng: 

a display for displaying tmfonndtion to a wer^ 

at Isast one sela^able group identifier visible ori the dfeplay. tKo ai least one eeldGtable group ideniifier rep- 
feserrttnb corresponding groups di etecumehts, whereh the groups of documoms corriirisB the pUirallty of 
docurnenis that are grouped based on the content of the phjialtty of (Axuments, wherein the at least one 
setectaUe group Identifier Is responsrve to a eeleedon of the at least one sefeotable group jdentWer to select 
a CQirs^wndlng at least one group of documents; and 

a storage manager responsive to a command to perfoim a storage management function on the selected at 
least diip group of dooumenie. 

Tlie Interfa;© of claim 7, wKerebi the plurality of doeumente are grouped based on the similarity of the "content of 
the plurality of docurrients. ■ . - • 

The Interface of claim 7 or claim a, wherein the plurality of documents are grouped further based on at l^st one 
anrtbute of each of the plurality bf documents. 

10. The Interface of any of claims 7 to 9, wherein the at least one seledabia group jdanlifier la dtepiayed hi a list that 
is ordered t»sed on at feast one group characteristic. 
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